Statistical Evaluation of Experimental Determinations of Neutrino Mass Hierarchy 
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Statistical methods of presenting experimental results in constraining the neutrino mass hierarchy 
(MH) are discussed. Two problems are considered and are related to each other: how to report the 
findings for observed experimental data, and how to evaluate the ability of a future experiment to 
determine the neutrino mass hierarchy, namely, sensitivity of the experiment. For the first problem 
where experimental data have already been observed, the classical statistical analysis involves con- 
structing confidence intervals for the parameter Am^ 2 - These intervals are deduced from the parent 
distribution of the estimation of Am^ based on experimental data. Due to existing experimental 
constraints on |Am| 2 |, the estimation of Am| 2 is better approximated by a Bernoulli distribution 
(a Binomial distribution with 1 trial) rather than a Gaussian distribution. Therefore, the Feldman- 
Cousins approach needs to be used instead of the Gaussian approximation in constructing confidence 
intervals. Furthermore, as a result of the definition of confidence intervals, even if it is correctly 
constructed, its confidence level does not directly reflect how much one hypothesis of the MH is 
supported by the data rather than the other hypothesis. We thus describe a Bayesian approach 
that quantifies the evidence provided by the observed experimental data through the (posterior) 
probability that either one hypothesis of MH is true. This Bayesian presentation of observed experi- 
mental results is then used to develop several metrics to assess the sensitivity of future experiments. 
Illustrations are made using a simple example with a confined parameter space, which approximates 
the MH determination problem with experimental constraints on the |Am| 2 |- 
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I. INTRODUCTION 

Neutrino mass hierarchy (MH), i.e. whether the mass 
of the third generation neutrino (1/3 mass eigenstate) is 
greater or less than the masses of the first and the sec- 
ond generation neutrinos {y\ and 1^2), is one of the main 
questions to be answered in the standard model. Be- 
sides its fundamental importance to neutrino oscillation 
physics, the resolution of the neutrino MH plays an im- 
portant role for the search of ncutrinoless double-beta de- 
cay, which would determine whether neutrino is a Dirac 
or Majorana fcrmion. With the recent discovery of a 
large value of sin 2 26>i 3 from Daya Bay T2K @, 

MINOS [|, Double Chooz 0, and RENO @, the stage 
for addressing the neutrino MH has been set. It became 
one of the major goals of current and next generation 
long baseline neutrino experiments (T2K |9[, NO^A [l(| 
and LBNEJljJ) and atmospheric neutrino experiments 
(Super-K [IJTmINOS [l|, PINGU and INO [ljj). 
Meanwhile, the idea of utilizing a reactor neutrino ex- 
periment to determine the MH is also intensively dis- 
cussed [H-ii. 



The objective of this paper is to present appropriate 
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ways to do statistical analysis that will help determine 
the neutrino mass hierarchy. We start by introducing 
a few symbols and state the physics problem in terms 
of a pair of statistical hypotheses. Let mi, m,2 and 7713 
denote the masses of the v\ , v-i and ^3 mass eigenstate 
neutrinos, and let Am?- = mf — m 2 for i, j = 1, 2, 3. As 
reviewed in Ref. [21j, it is known that Am^ > from 
measurements of solar neutrinos given the definition of 
mixing angle 6*12 . Whereas the sign of Am§ 2 is so far un- 
known, and it's common to use NH and IH to denote the 
two hypotheses, the normal hierarchy and the inverted 
hierarchy, respectively: 



NH : Am§ 2 > ; 



IH 



':S2 



< 



(1) 



A unique feature to the above hypotheses testing problem 
is that, there are additional, rather strong information re- 
garding the parameter Am| 2 that need to be taken into 
account properly. Actually, based on previous experi- 
ments, a 68% confidence interval of M| 2 = |Am| 2 | is 
given by (2.43 ± 0.13) x 10~ 3 eV 2 

We will mainly address two aspects of the hypotheses 
testing problem. The first one concerns conducting a test 
after data has been collected. We discuss a classical test- 
ing procedure based on a A\ 2 statistic (Eq.[3J), or cquiva- 
lently, the procedure of constructing confidence intervals 
by inverting the test. As a matter of fact, the classical 
procedure is derived upon the assumption that the best 
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estimator of Am 2 2 based on experimental data would fol- 
low a distribution that is approximately Gaussian. But 
due to existing constraints on M| 2 , this assumption is far 
from being satisfied. Consequently, actual levels of the 
resulting confidence intervals may deviate substantially 
from their nominal levels, as we demonstrate in Sec. [TXJ 
Instead, a general way to construct confidence intervals 
that are true to their nominal levels is the Feldman- 
Cousins approach [23j . which we also illustrate in detail 
in Sec. El 

Still, there is a fundamental limitation to the use of 
confidence intervals. Note that in the MH determination 
problem, one of the most crucial questions is, what is the 
chance that the MH is indeed NH (or IH) given the ob- 
served experimental data? Classical confidence intervals 
are not meant to answer this question directly, whereas 
credible intervals reported by a Bayesian procedure is. In 
Sec. IIII1 we present a Bayesian approach, which effort- 
lessly incorporates prior information on M| 2 and output 
the easy-to-understand (posterior) probability of NH and 
IH to conclude the test. We will emphasize the impor- 
tance to differentiate the Bayesian credible interval from 
the classical confidence interval. 

The second aspect of the hypotheses testing problem 
that we address concerns assessment of experiments in 
their planning stage. It is critical to evaluate the "sen- 
sitivity" of a proposed experiment, i.e., its capability 
to distinguish NH and IH. Since this evaluation is per- 
formed before data collection, it has to be based on po- 
tential data from the experiment. An existing evalua- 
tion method (such as employed in [IJ [24], [25[) assumes 
that the most typical data set under one hypothesis, say 
NH, happens to have been observed. Such a data set 
is referred to as the Asimov data set (26j . The method 
then calculates A\ 2 , which stands for the statistic Ax 2 
in Eq. |3l with the extra bar indicating its de pend ence on 
the Asimov data set. It can be seen that A\ 2 reflects 
how much the Asimov data set under NH disagrees with 
the alternative model, IH. It is then common practice 
to quantify the amount of disagreement by finding the 
p- value corresponding to A% 2 after comparing it to the 
quantiles of a chi-squarc distribution with one degree of 
freedom (choice of MH) . Finally, one minus this p- value 
is sometimes reported as a quantitative assessment of the 
experiment. We will show in Sec. [H] that the comparison 
of the value of Ax 2 to the quantiles of a chi-square distri- 
bution is not justified, when previous knowledge impose 
constraints on the range of possible values of the param- 
eter Am§ 2 . 

As an alternative solution, we adopt a Bayesian frame- 



work and develop a set of new metrics for sensitivity to 
evaluate the potential of experiments to identify the cor- 
rect hypothesis. 

The paper is organized as follows. In Sec. [HI we review 
the steps to construct classical confidence intervals for 
the parameter Am 2 2 - In Sec. IIIIl we describe a Bayesian 
approach that reports the probability of each hypothesis 
of MH given observed data set. We further extend this 
Bayesian method to help assess the sensitivity for future 
experiments. In Sec. IIV1 we illustrate the Bayesian 
approach for a simplified version of the MH problem. 
In particular, analytical formula of the approximations 
for the probability of the hypotheses, and those for 
the sensitivity metrics are provided. Also, a numerical 
comparison is made between the A\ 2 based on the 
Asimov data set and the sensitivity metrics based on the 
Bayesian approach. Finally, discussions and a summary 
are presented in Sec. [V] and Sec. IVI1 respectively. 



II. ESTIMATION IN CONSTRAINED VERSUS 
UNCONSTRAINED PARAMETER SPACES 

In this section, we review a classical statistical proce- 
dure of forming confidence intervals. For the problem of 
determining the neutrino mass hierarchy, we demonstrate 
that the procedure is valid in one scenario, but fails in 
another where known constraints on M 2 2 are taken into 
consideration. In the latter case, the Feldman- Cousins 
method [23| based on Monte Carlo (MC) simulation is 
recommended to obtain valid confidence intervals. 

Consider a spectrum that consists of n energy bins. 
Assume that the expected number of counts in each bin 
is a function of Am§ 2 and a nuisance parameter r\. For 
simplicity, we denote Am\ 2 by 0. Then for the zth bin, let 
/j,i(0,T]) and Ni represent the expected and the observed 
counts of neutrino induced reactions, respectively. When 
/Xj is large enough, the distribution of Ni can be well 
approximated by a Gaussian distribution with mean /Xj 
and standard deviation y/Jil. 

Once the data x = {N,i = l,...,n} are observed, 
the deviations from the expected values {/j,i(9,rf),i = 
l,...,n} are often calculated to help measure the im- 
plausibility of the parameter (#,77). Specifically, when 
the systematic uncertainties are omitted, and that cer- 
tain available knowledge concerning the parameters 8 and 
r\ are taken into consideration, one useful definition of the 
deviation is given by 
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Here, the general notation Sw represents the standard deviation of a variable w. So SN = i/fli, and the cor- 



3 



responding xttat term is called the Pearson's ch i- sq uare. 
Also, note that \9\ = Mf 2 , and it is taken from [22| that 
|0 O | = 2.43 x 10~ 3 eV 2 and 5\6\ = 0.13 x 10~ 3 eV 2 . 

Based on Eq. [2] and a standard procedure discussed 
in Ref. 22[, confidence intervals can be obtained for the 
parameter of interest 9 (Am| 2 ), the sign of which is an 
indicator of the neutrino MH. First, define m ; n to be 
the best fit to the data in the sense that (# m i n , r] min ) = 
arg mine jl; x 2 ($j V) where the minimum is taken over x 
H, the space of all possible values of {9,rf). Here, the 
general notation argmin^, h(w) denotes the value of w 
which corresponds to the minimum of the given function 
h. Note that # m ; n suggested by the observed data set 
will not be exactly the true value of the parameter 9, 
and a repetition of the experiment would yield a data 
set that corresponds to a different 9 m - m . So instead of 
reporting only # m in, it is more rational to report a set of 
probable values of 9 that fit the observed data not too 
much worse than that of the best fit, and state how trust 
worthy the set is. Indeed, for any given 9, let rj m i n (9) = 
argmin l; y 2 {9, 77), and define 



A x 2 mm (O) = x 2 (0,Vrmn(9))-x 2 



), (3) 



then a level a confidence interval based on Eq.[3]is defined 
to be 



C a = {0eQ: A x 2 mm (9) <U 



(4) 



where we use the standard set-builder notation {h(w) : 
restriction w} to denote a set that is made up of all 
the points h(w) such that w satisfies the restriction to 
the right of the colon. The key in constructing Eq. [4] 
is to specify the correct threshold value t a for a given 
confidence level a. (See the final paragraph of this sec- 
tion for a more detailed description of what confidence 
level means.) Most commonly examined confidence lev- 
els use a = 68.27%(lcr), 95.45%(2cr), 99.73%(3cr), which 
are often linked to threshold values t a = 1, 4, 9 respec- 
tively [22J. Note that these three values are the 68.27%, 
95.45% and 99.73% quantiles of the chi-square distribu- 
tion with one degree of freedom, respectively. They are 
used as threshold values because the parameter space 
is of dimension one and that, under certain regularity 
conditions, Ax 2 n i n {9) would follow approximately a chi- 
square distribution with one degree of freedom when 9 is 
the true parameter value. This procedure and its exten- 
sions to cases where 9 is of higher dimension have been 
successfully applied in many studies [ll], [H, [25], [27h[33l | 
in order to constrain various parameters in the neutrino 
physics. 

Although this procedure has been widely used in ana- 
lyzing experimental data, note that it is not universally 
applicable. Its limitations has been addressed by Feld- 
man and Cousins [23| . Below, we illustrate this point 
through a simple MC simulation study. It will be shown 
that, in a situation that is similar in nature to the MH 
determination problem in Eq. [1] where there exist special 
constraints on the possible values of the aforemen- 



tioned threshold values based on chi-square approxima- 
tion could result in bad confidence intervals. That is, 
the actual coverage probabilities of the intervals strongly 
disagree with their nominal levels. 

In the simulation, we set n = 10, Hi{9) = 1000 + 15-0 
for i = 1, . . . , n. (Here, no nuisance parameter 7/ is in- 
troduced, and all the expected bin counts are assumed 
equal for simplicity. Nevertheless, these assumptions are 
not essential to the purpose of our simulation.) The fol- 
lowing two cases are investigated: 



• Case I: = (—00, 00), 

• Case II: = {-1,1}. 



Case I is a typical situation where nothing was known 
about 9 before the current experiment, whereas case II 
is designed to imitate the situation where existing mea- 
surements of \9\ = Af| 2 are very accurate at around 
2.43 x 10 -3 eV 2 , and we simply denoted this value 
to 1 for clarity of presentation. Further, the defini- 
tion of deviation analogue to Eq. [2] is taken to be 

X 2 (9) = J2i ^ Ni ^(e)^ f° r casc I- For case II, the chi- 

square definition is X 2 (0) = Ei + Iff 

with experimental constrains on \9\. It is then reduced 



to x 2 (0) = Y, 



(Af.-M9)r 



with 9 being only 1 or -1. 



Under each case, we set the true value of 9 to be 
#0 = 1, based on which 100,000 MC samples are simu- 
lated, denoted by {N^\ • • • , N$} for j = 1, . . . , 100000. 
Then for the jth sample, confidence intervals of levels 
a = 68.27%, 95.45%, 99.73% are constructed according 
to Eq. [4] using threshold values 1, 4, 9, respectively. Fi- 
nally, at each of the three levels, we record the proportion 
of confidence intervals out of the 100,000 that include the 
truth 9q = 1. The results are reported in the last three 
columns of Table [H It can be seen that, in case I, the 
actual coverage probabilities closely match the nominal 
levels. However, in case II, the actual coverage probabil- 
ities are always higher! 

Without too much technical detail, we try to explain 
the reason why the chi-square procedure produced valid 
confidence intervals for case I, but not for case II. In 
general, having observed data x from a parametric model 
P(x\9), a sensible test for a pair of hypotheses, Ho : 9 E 
O and Hi : 9 <G — 0o (the counterpart of H ), is the 
likelihood ratio test that is based on the test statistic 
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P(x\9 ^ in ) 
P(x\9 min ) 



(5) 



where 6> , m in = argmin {eeeo} P(x\6), and 9 min = 
argmin{0 6 0} P(x\9) are the best fit over the null param- 
eter set ©o and the full parameter set 0, respectively. If 
the observed data x yields a large Ax^ in , it means that 
©o is implausible, which further leads to the rejection 
of H . Note that, the statistic Ax 2 nin (6') in Eq. [3] is a 
special case of Eq. [5] with ©o consisting of a single point, 
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Case 






distribution 


AXmin — 1 


Ay 2 < 4 


Ay 2 < 9 




distribution 


distribution 


parameter 


confidence 


confidence 


confidence 








within this example 


level 


level 


level 


I 


Chi-square 


Gaussian 


mean = 1 and a = 0.67 


68.27% 


95.48% 


99.73% 


II 




Bernoulli 


p = 0.0679 


95.12% 


98.48% 


99.86% 



TABLE I: Confidence levels for various of Axmin region for the Gaussian and the Bernoulli distribution from MC. In Case I, 
the mean and the standard deviation of the Gaussian distribution is found to be about 1 and a = 0.67 respectively. In Case 
II, the parameter p of the Bernoulli distribution (e.g. percentage of 9 m in < 0) is is found to be about 6.8%. 



In order to determine the correct threshold values 
in rejecting, or equivalently, in constructing confidence 
intervals defined by Eq. 21 the distribution/quantilcs 
of Ax?, lin {6o) considering all possible data set needs 
to be known, when the true parameter value is some 
6*o € On- An important result in statistics, Wilks Theo- 
rem [H, states that, under certain regularity condi- 
tions, Axmin(@o) follows approximately a chi-square dis- 
tribution with degree of freedom equal to the difference 
between the dimension of and that of Oo, when the 
data size is large. (In our problem, the data size is sim- 
ply y] ■ Nj.) The main regularity conditions are, as we 
quote [35|], "the model is differentiable in 9 and that Oo 
and 9 are (locally) equal to linear spaces" . Essentially, 
such conditions imply that 9 mm follows an approximately 
Gaussian distribution centered at the true 9 value, which 
eventually implies an approximate chi-square distribu- 
tion for Ax m ^ n 

In case I of our simulation, the best estimation of 9 can 
be calculated directly from the number of events in each 
bin: 9 min = N i/ n ~ 100 °] / 15 1 -The aforemen- 

tioned regularity conditions are satisfied in this case, and 
the distribution of min and that of Ax^ nin (9o) follow ap- 
proximately the Gaussian and the chi-square distribution 
respectively as what Wilks theorem predicts. In the top 
two panels of Fig. [TJ we reconfirm this fact by compar- 
ing their histograms based on the 100,000 MC samples 
(black shaded area) to the probability density function of 
the Gaussian and the chi-square distribution (blue long 
dash-dotted line). On the other hand, the full parame- 
ter space in case II consists of two isolated points and 
clearly violates the conditions required by Wilks theo- 
rem. Indeed, in case II, the best estimation of 9 is given 
by 



l if x 2 (# = i)<x 2 ( 

— 1 otherwise 



■1 



follows a Bernoulli distribution, and Ax„ in (So) follows 
a distribution quite different from a canonical chi-square 



distribution. Approximations to the actual distributions 
of Ax 2 nin (^o) and 9 min can be obtained from the 100,000 
MC samples, and are shown (black shaded area) in the 
bottom two panels of Fig. [TJ Further, an analytical ap- 
proximation (red dash-dot-dotted line) to the distribu- 
tion is derived in Appendix A. The analytical calculation 
implies that, independent of whether the truth 9q is 1 or 
— 1, the p- value 2 corresponding to an observed value of 
AXmm^o), sa y t> is approximately given by 



p-value(t) = P(A X 2 min (0 O ) > *) 



1 1 , / t + A X ' 
- - -erf 

2 2 



8A X 2 



(6) 

for any t > 0; and the p- value is 1 for any t < 0. Here 
erf is the Gaussian error function: erf(x) = ^7= J Q e~* dt. 

(We use the general notation P(A) to denote the proba- 
bility of an event A.) 

The discussions above suggest that, when constructing 
confidence intervals in special cases where conditions of 
Wilks theorem do not hold (or that the user can not 
be sure if the conditions hold), the regular threshold 
values (such as t a = 1, 4, 9 mentioned earlier) should 
not be taken for granted. Instead, alternative thresholds 
based on MC or case-specific analytical approximations 
are needed. We recommend using the MC method with a 
large MC sample size whenever possible, because unlike 
other methods, it is guaranteed to produce a valid con- 
fidence interval for 9. We hereby review how to produce 
a valid l-er (68.27%) confidence interval for 9 using the 
MC method [23[ . This method can easily be generalized 
to build confidence intervals of any level. 

• Having observed data x = {Ni, ■ • • , N n }, apply the 
following procedure to every 9 in the parameter 
space (fix one 9 at a time): 

1. Calculate Axmin(@) x with Eq.|3] based on the 
observed data. 

2. Simulate a large number of MC samples, say 
where a#) = {N[ j \--- ,N^ ] ] is 



Note that the above m j n can be closely approximated by 
[QZ2=i Ni)/n — IOOO] /15, which is indeed the exact maximum 
likelihood estimator for 9 had we assumed that each count Ni 
follows a Poisson distribution with mean fii(0) = 1000 + 15-6. 



2 The p-value at t is defined to be the percentage of potential 
measurements that result in the same or a more extreme value 
of the test statistic, say ^Xmin' than i. 
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Case I: Normal Case I: Normal 




FIG. 1: (color online) Distributions of Axmini^o) and 6 m in for case I and case II with 100,000 MC samples. The 6 m in 
distribution of case I (top right) and case II (bottom right) are a Gaussian and a Bernoulli distribution, respectively. The 
^Xmm(^o) distribution of case I (top left) is consistent with the chi-square distribution with degree of freedom one. The 
commonly used la, (68.27% confidence level) and 2a, (95.45% confidence level) regions are labelled with red dashed and 
black dash-dotted lines for case I. The Axmin(@o) distribution of case II (bottom left) strongly deviates from the chi-square 
distribution. In case II, we also show the analytical approximation (derived in Appendix. of the distribution of Axmin- 
We should emphasize while chi-square distribution does not depend on any additional parameter (other than Axmi„), the 
analytical approximation depends on Ax 2 - 



generated from the model with true param- 
eter value 9. For j = 1,...,T, calculate 
A Xmm( S ) (i) i that is Eq. E] based on the jth 
MC sample x*^ . This produces an empirical 
distribution of the statistic A^ in (^). 

3. Calculate the percentage of MC samples such 
that A^ fa (9)M < A X 2 min (6r. Then 9 is 
included in the 1-er confidence interval if and 
only if the percentage is smaller than 68.27%. 

One can easily check that p- values analytically obtained 
from Eq. [5] for case II (basically the MC method) are 



consistent with the simulation results listed in Table HI 

On a separate issue that was also emphasized in 
Ref. [23[ , classical confidence intervals should not be con- 
fused with Bayesian credible intervals. On one hand, 
the confidence-level of a confidence interval, say a, is 
an evaluation of this interval estimation procedure based 
on many potential repetitions of the experiment. More 
specifically, had the experiment been independently re- 
peated 100 times, applying the estimation procedure to 
each would result in 100 intervals, and a represents the 
proportion of these intervals that we expect to contain 
the true value of the unknown parameter 9. The level- 
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a confidence interval reported in practice is the result 
of applying such a procedure to the data observed in the 
current experiment. On the other hand, a Bayesian cred- 
ible interval, say of credible-level b, is a region in the 
parameter space such that, given the observed data, it 
contains the true value of the unknown parameter with 
probability b. In general, an a-level confidence interval 
does not coincide with an a-level Bayesian credible inter- 
val. In other words, if C a is an a-level confidence interval 
built from the observed data x, then it is generally inap- 
propriate to give the interpretation that P(9 G C a \x) (the 
probability of true 9 inside C a given data x) is a. Nev- 
ertheless, in Appendix [Bl we discuss when confidence in- 
tervals approximately match Bayesian credible intervals. 
In the next section, we present a Bayesian approach to 
the problem of determining neutrino mass hierarchy. 



III. A BAYESIAN APPROACH TO 
DETERMINE NEUTRINO MASS HIERARCHY 

A. Bayesian inference based on observed data 

The MH determination problem is concerned with 
comparing two competing models, NH and IH, hav- 
ing observed data x. The Bayesian approach to the 
problem is based on the probabilities that each model 
is true given x, namely, P(NH\x) and P(IH\x) = 
I — P(NH\x). (In general, we adopt the notation 
P(A\Bi, - ■ ■ ,B n ) to represent the probability of event 
A given events Bi,--- ,B n . Also, we use capital letters 
such as Si,-- - ,S n and T to denote random variables, 
and use small letters such as s\, ■ ■ ■ ,s n and t to denote 
numbers inside the range of possible values of the ran- 
dom variables. Then Pt\s± .S„(^l s ii ' • • j s n) denotes for 
the conditional probability density function (pdf ) or the 
conditional probability mass function (pmf ) given events 
Si,-- - , S n = Si,-- - ,s n . The subscript to P is often 
omitted when it is clear what random variable is be- 
ing considered.) Model NH will be preferred over IH 
if the odds r(x) = P{IH\x) / P{N H\x) < 1. Moreover, 
the size of r serves as an easy-to-understand measure for 
the amount of certainty of this preference. Alternatively, 
some people may feel more comfortable in interpreting 
P(NH\x) = 1/(1 + r(x)) directly. 

One can determine P(NH\x) and P(IH\x) within a 
Bayesian framework as follows. Let the true value of MH 
be either NH or IH, and let the counts Ni follow a Gaus- 
sian distribution with mean /z^ IH (0, ??mh) and standard 
deviation \J ix^ m {9, ?7mh) for i = 1, • • • ,n. Here, 9 is the 
parameter of interest, and ?7mh denotes other unknown 
nuisance parameter(s). Here, a subscript accompanies r\ 
to emphasize that the nuisance parameter is allowed to 
have different interpretations and behavior under the two 
hypotheses. (We will omit this subscript whenever there 
is no possibility of confusion.) If prior knowledge is avail- 
able for 9 and rj, then they should be elicited to form prior 
distributions, P(9,r]\MH) for MH=IH, NH. Sometimes, 



it is reasonable to assume that the parameters 9 and r\ are 
independent conditional on MH, hence P(0,rj\MH) = 
P(ti\MH)P(9\t 1 ,MH) = P{r]\MH)P{6\MH). 

Specific to the MH problem at hand, under NH (IH), 
previous knowledge (e.g., from [23 |) suggests that a sen- 
sible prior for 9 would be a Gaussian with mean 2.43 x 
1(T 3 cV 2 (-2.43 x 10~ 3 eV 2 ) and standard deviation 
0.13 x 10~ 3 eV 2 . Since the hypotheses being tested are 
NH : 9 G Q NH = (0, 00) versus IH : 6 G B IH = (-00, 0), 
P(9\NH) and P(9\IH) are specified to be the truncated 
version of the above Gaussian distributions supported 
within &nh and 0/£r> respectively. Nevertheless, in our 
Bayesian model, P(9 G Qih\NH) and P(9 G Snh\IH) 
based on the Gaussian prior are so tiny that they will 
yield the same numerical results as the truncated ver- 
sion. Similar choice can be made for P(rj\MH). 

According to Bayes' theorem, we have 



P(NH\x) 



P(x\NH) -P(NH) 

W) 

P(x\NH) 



(7) 



P(NH) 



P{x\NH) ■ P{NH) + P(x\IH) ■ P(IH) 



Here, P(NH) and P(IH) = 1 - P(NH) should reflect 
one's knowledge in NH and IH prior to the experiment. In 
the MH problem, it is reasonable to assume that NH and 
IH are equally likely, that is P(NH) = P{IH) = 50%. 
We will make this assumption throughout the paper. 
Consequently, Eq. [7] reduces to 



P(NH\x) 



P(x\NH) 



P{x\NH) + P(x\IH)' 



(8) 



Based on probability theory, P(x\MH), i.e. the 
likelihood of model MH, is a "weighted average" of 
P(x\9, r), MH) over all possible values of (9, rf): 



P(x\MH) 



Hmh ^©mh 



P(rj\MH)P(9\r), MH)P(x\9, 77, MH)d9dr] , 

(9) 



in which Hmh represents the phase space of nuisance pa- 
rameter 77 given the choice of MH. Further, under the 
assumption that 9 and ?y are independent, Eq. [5] is re- 
duced to: 



P(x\MH) 



Hmh J ©mh 



P(r)\MH)P(O\MH)P(x\0, rj, MH)d9dr] . 

(10) 



In practice, the integral in Eq. [9] is often analyti- 
cally intractable, but can be approximated using MC 
methods. Using a basic MC scheme, first, a large 
number of samples {(6^' ,r}^'),j = 1,...,T} are ran- 
domly generated from the prior distribution P(8, r)\MH). 
Then for the observed data x, obtain Pt(x\MH) := 
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Q 


0.475 


1 


1.281 


1.645 


2 


3 


4 


5 


one-sided p-value: p a 


31.74% 


15.87% 


10% 


5% 


2.28% 


0.13% 


3.2e-5 


3.0e-7 


AxL 


1.53 


3.33 


4.39 


5.89 


7.52 


13.29 


20.70 


30.04 



TABLE II: Tabulated results of AxL- For a given a, the one-sided p-value is p a = P(Z > a) (probability of Z > than a) where 
Z stands for a standard Gaussian random variable. The corresponding Ax 2 value is given by Axa CT = — 21og(p a /(l — Pa))- 



T^Yjj^iPiA^^ ,MH). As the MC size T in- 
creases, the estimator Pt{x\MH) will have probabil- 
ity approaching 1 of being arbitrarily close to the true 
P(x\MH). Note that there exist much more efficient 
MC algorithms, such as importance sampling algorithms, 
that require smaller, more affordable T for the resulting 
estimators to achieve the same amount of accuracy as 
that of the basic MC scheme. Interesting readers are 
pointed to [36| for further details and references. 

There also exist (relatively crude) approximations to 
P(x\MH) in Eq.[S]that avoid the intense computation in 
the MC approach. A most commonly used one is the one 
on which a popular model selection criteria, the Bayesian 
information criterion (BIC) is based. This approximation 
is often presented in terms of an approximation to a one- 
to-one transformation of P(x\NH), namely 

A X 2 {x) = -21ogr(x) = -2log(P(IH\x)/P(NH\x)) . 

(11) 



where {0,fj) and (6',ff) denote maximizers of 

P{x\9, T), MH)P{r}\MH)P(6\MH) , MH=NH, IH. 

within their respective range. (Note that 8\ 2 is essen- 
tially an alternative version of Ax^i n in Eq. [3J bear- 
ing some technical difference only.) Here, the term 

y^"_i log is the result of the normalization fac- 

tor (e.g. (27rer 2 )~2) of the Gaussian pdf, and is in gen- 
eral small compared to AT. In the classical testing pro- 
cedure, the observed value of 5\ 2 will be compared to 
its parent distribution to get a p-value. Whereas the 
Bayesian approach described in this section directly in- 
terprets the value of AT, by transforming it to either 
the odds ratio between NH and IH, r(x) = e~ Ax rj 



Denote 



AT(x) = Tih(x)~Tnh(x). 



(12) 



Then if the sample size ^ Ni is large, and rjN h and tjjh 
arc of the same dimension, then 



A x 2 (x) = 2\ogP{x\NH) -2\ogP{x\IH) 
« AT{x) . 



(13) 



Here, the equality follows from Eq. [8] and the approxi- 
mation is supported by a crude Taylor expansion around 
the maximum likelihood estimator for the parameters. 
There arc other approximations that follow the same line, 
that are more accurate but also computationally more 
demanding. See [36| for details. 

One remark should be made regarding AT, as it is 
closely related to a commonly used test statistic in the 
classical testing procedure. Indeed, if the truncated 
Gaussian priors mentioned earlier are assigned for 9 and 
a Gaussian prior with mean 770 and standard deviation 8rj 
is assigned for 77 under both NH and IH, then according 
to the definition of \ 2 i n Eq. HI we have 



5x 2 



X 



x 2 (M) = AT-5>g 



-AT(i)/2 



, or the probability of NH, 



P(NH\x) = 



1 



1 + r(x) 
1 



1 



Tmh(x) = -2log{m&xP(x\6,ri,MH)P(Ti\MH)P(6\MH)}, 

9,rj 

where the maximum is taken over (6, rj) € 6mh x -Hmh 
and 



1 + e -A X 2 (x)/2 1 + e -Ar(x)/2 
and similarly, the probability of IH, 

r(x) 



.(15) 



P(IH\x) = 



1 + r(x) 



-A X 2 (x)/2 



-AT(x)/2 



l + e -A x 2 (x)/2 1 + e -AT(x)/2 



(16) 



(14) 



B. Sensitivity of experiments 

So far, we described the Bayesian procedure for test- 
ing the two hypotheses, NH and IH, given observed data 
x = {Ni,-- - ,N n }. Reasoning backwards, foreseeing 
what analysis will be done after data collection allows 
us to address the question that, before data is collected 
from a proposed experiment, how confidently do we ex- 
pect it to be able to distinguish the two hypotheses NH 
and IH. We loosely refer to such an ability as "sensitiv- 
ity" of the experiment. There could be many ways to 
define sensitivity, and we list a few below. In practice, 
evaluating a proposed experiment using one or several 
of these sensitivity criteria provides views from different 
angles of the potential return from the experiment. 

Note that sensitivity depends on the underlying true 
model as well as future experimental results generated 
from this model. For example, if NH is true, then 
we have a population of potential experimental results 
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x ~ P{x\NH) = J J P(x\9,r],NH)P(9,r]\NH)dOdr). 
And each potential x is associated with a posterior prob- 
ability P(NH\x). Then one could evaluate the ability of 
an experiment to confirm NH when it is truly the under- 
lying model by looking at the distribution of P(NH\x). 
The most typical numerical summaries of this distribu- 
tion include its mean, quantiles and tail probabilities, all 
of which can be used to address sensitivity. 

Below we officially develop metrics for sensitivity under 
the assumption that NH is true. Note that these metrics 
can be similarly defined when IH is true. 

1 . The average posterior probability of NH is given by 



P 



NH 
T=NH 



P{NH\x)P{x\NH)dx 

30 



(17) 



> 1 + e -AxV2 
1 

_^ l + e - A x 2 /2 



P{x\NH)dx 
P(A x 2 \NH)dA x 2 ■ 



Note that the first integral above involves calculat- 
ing an A-dim integral, and the last one is of 1-dim 
only. The latter is much easier to obtain, an exam- 
ple of which will be presented in the next section. 

2. The fraction of measurements x that favor NH, i.e., 
the fraction of x such that P(NH\x) > .5, is given 
by 



F T = 



NH 



{x:P(NH\x)>.5} 



P(x\NH)dx 



/ P{A x 2 \NH)dA X 2 
Jo 



(18) 



Here, "F" and the subscript "T = NH" stand for 
fraction and the NH assumption, respectively. 

If NH is the correct hypothesis, then a good ex- 
periment should have a high probability of pro- 
ducing data that not only favors NH but indeed 
provides substantial evidence for NH. Hence, it is 
useful to generalize the term in Eq. [TS] to gauge 
the chance of P(NH\x) > 1 — p for any thresh- 
old value 1 — p of interest. In particular, physicists 
are familiar with thresholds associated with the so- 
called aa level, with one-sided aa corresponding 
to 1 — p a = 1 — P(Z > a) for a standard Gaussian 
random variable Z 3 . Accordingly, define 



ZTiacr 

r T=NH 



{x:P(NH\x)>l-p a } 
poo 

\ P(A x 2 \NH)dA x 2 



P{x\NH)dx 



(19) 



A list of common aa values, the corresponding p a , 
as well as A Xaa = —2 \og(p a / (l—p a )) are listed in 
Table. M 

3. In addition, probability intervals (PI) for P{NH\x) 
also provide useful information. For example, a 
90% PI is denoted by (Pt= nh , 1), where P^°J NH is 
the 100-90=10th percentile of P(NH\x). That is, 
had NH been the truth, 90% of the potential data 
would yield P{NH\x) larger than Pt= nh - 

All the above criteria reflect the capability of the experi- 
ment to distinguish the two competing hypotheses, and 
they convey different messages. 

Finally, to get a complete picture of the sensitivity of 
an experiment, one should also obtain the above met- 
rics under the assumption that IH is the underlying true 
model. The sensitivity scores under metrics 2 and 3 can 
be shown to depend on the underlying true model. For 
example, we experimented with simple examples (not 
shown) and observed that in general Ft=nh ^ Ft=ih- 



NH 



-IH 



Whereas for metric 1, we have Pt=nh — Pt=ih as l° n g 
as equal prior probabilities 4 , P(NH) = P(IH), were 
assigned to the two models. This is because 



-NH -IH 

*T=NH *T=IH 

P(NH\x)P{x\NH) - P{IH\x)P{x\IH)dx 

P 2 (x\NH)P(NH) - P 2 (x\IH)P(IH) , 
P(x\NH)P(NH) + P{x\IH)P(IH) 

-fpwm-rwv*-*. 

In the next section, we use an example to show how 
one can easily calculate the posterior probability and the 
sensitivity measurements introduced above. We also con- 
trast the resulting sensitivity measurements to a com- 
monly used quantity that is known as "Ax 2 of Asimov 
data set" . 



IV. ILLUSTRATION OF THE BAYESIAN 
APPROACH IN A CONSTRAINED 
PARAMETER SPACE 

In this section, we consider a situation where 8 can 
only take on two possible values, 1 and —1, which corre- 
spond to the hypotheses NH and IH respectively. This 
simplified setting is motivated by the fact that exist- 
ing measurements of \6\ = M| 2 are very accurate at 
around 2.43 x 10" 3 eV 2 , and we simply denote this value 
to 1 for clarity of presentation. It is a special case of 
the Bayesian treatment in the previous section, where 



3 Another commonly used term is two-sided aa which corresponds 
to 1 - P(\Z\ > a). 



We acknowledge the referee for pointing out this important rela- 
tion. 
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|| NH MC 
5| IHMC 

NH Norm. Approx. 

— IH Norm. Approx. 




FIG. 2: (color online) The probability density functions 
P{Ax 2 \NH) and P(Ax 2 \IH) in the Bernoulli model are 
shown as the solid and dotted lines, respectively. The |Ax 2 | 
is assumed to be 9. 



P(6\NH) and P(9\IH) are assigned degenerate distribu- 
tions at 1 and —1 respectively. That is, P(9 = 1\NH) = 
P(6 = — l\IH) = 1. Further, there is no nuisance pa- 
rameter r). As a result, the expected bin counts will be 
denoted by fif H = /i(l) and [if H = 1) respectively. 

Below, we showcase numerical calculations of vari- 
ous sensitivity criteria for this example. In particu- 
lar, wc introduce approximations that arc simple func- 
tions of a term commonly known as "Ax 2 of Asimov 
data set" in the physics literature. According to the 
definition in Ref. |26( . "the Asimov data set" under 
hypothesis MH is given by x MH = (nf H ,--- ,A*jlf ff ), 
wheic [i i - iii{t> ,T] ) and {t> ,r) ) — 
argmax(g ii; ) P{9, rj\MH) is the prior mode under MH. 
In words, the Asimov data set is the most typical data 
set under the most likely parameter values based on prior 
knowledge subject to the given model. 



Interestingly, A\ 2 is itself often used as a measu re of 
sensitivity. Here, we'll contrast the typical usage of Ax 2 
to that of the sensitivity criteria developed in the previ- 
ous section. More accurate evaluations of these sensitiv- 
ity criteria are also attainable via MC methods. 

Suppose that the proposed experiment will collect 
enough data such that the expected counts under NH 
and IH are much larger than the difference between them: 
fif H ~ >> \^f H — n\ H \- Using the notations intro- 
duced in Sec. UTJ if the nature is NH, then the observed 
counts Ni can be represented as 



N, 



NH 



(20) 



where g\ , ■ ■ ■ ,g n are mutually independent standard 
Gaussian random variables. Then, the statistic Ax 2 of 



Eq. [TT] becomes 



&Xt=nh 



i=l 



IH 



" (,,NH ,.IH\ I ..NH n 

i=l 



I' 



IH 



" ..NH ..IH 

E ft ~ 2 
..IH "i 



J>g(l 



U NH _ IH 
,.IH 



(21) 



Here, the subscript T = NH indicates that the nature 
is NH. Since ^ » \nf H - juf H |, the summation of 
the last two terms in Eq. [5T] is negligible as it is approx- 
imately Y^ii=i Mi ll iH i — ' (<?f — 1) by a Taylor expansion 

of the last term. Therefore, Axt—nh follows a Gaussian 
distribution, with mean and standard deviation: 



( n NH -,, IH V 

1 =1 



Ax 2 
CT Ax 2 



'Ell! 



( l _ l NH_ fl i H y 



« 2^/Ax 2 

In the last step, since nf H 
we further neglect the term 



i4 H « rf H 



(22) 

IH 



IH y 



(A1 m )2 • Similarly, it is 

straightforward to show that when nature is IH, Axt=ih 
would follow an approximate Gaussian distribution with 
mean = —Ax 2 and standard deviation oax 2 - ^ n fact, 



when IH is true, Axj H 



I NH IH\ Z 

L,i=i -pra A X 



To see how the above approximation works, we look 
at the example in Sec. [Hi where Ax 2 ~ 9. Fig. [2] shows 
histograms (shaded area) based on large MC samples of 
Ax 2 under NH and IH respectively. They agree very 
well with the analytical approximation (dashed lines) in 
Eq.[H 

Now, we are ready to calculate (1) the probability of a 
hypothesis post data collection, and (2) various measure- 
ments of sensitivity for an experiment concerning poten- 
tial data generated from it. 

First, given observed data x = (Ni, ■ ■ ■ , N n ), the prob- 
ability P(NH\x) can be directly calculated from Eq. [7] 

(t-m) 2 

Let G(t; m, a) = -j=—e denote the pdf of a Gaus- 



27T-CT 

sian random variable with mean m and standard devia- 
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tion cr, evaluated at t, then 

P(x\NH) ■ P(NH) 



P{NH\x) 



where 
A x 2 (x) 



P(x\NH) ■ P(NH) + P{x\IH) ■ P(IH) 
1 

I +e -A X 2(x)/2 



We mention that, if one reduces the full data x to its 
function A% 2 (x), then calculating P(NH\Ax 2 ) based on 
our approximation in Eq. [22] will recover P(N H\x): 



n 

£ 



r'i 



,IH 



) 2 (*< 



,NH 



21 



,IH 



,NH 



P(NH\A X 2 ) = £i A X 2 \NH)-P(NH) _ P(A X 2 \NH) 



P(Ax 2 ) P(A X 2 \NH) + P(A X 2 \IH) 

G(Ax 2 ;A^,2y/^j 



G ^A x 2 ;A x 2 ,2 ] jAx 2 j + G {A x 2 ; -A X 2 ,2^A X 2 

I 

Next, we evaluate various sensitivity metrics of a fu- for A% 2 in Eq. 
ture experiment, using again the Gaussian distribution 



1 + e-Ax 2 /2 ' ( 23) 



p NH 



T=NH 



pA% 
*T=NH 



1 



l + e -i/2 



G t, Ax 2 ,2\J Ax 2 dt = P{Ax 2 ) 



G\t-Ax 2 ^A x 2 }dt=- | 



A X 2 



G t; A X 2 , 2yJ A X 2 ) dt = - 1 + erf 



1 



AX 2 - A X L 



8Ax 2 



1/ U + e" 



•4 A X 2„2^VAx 2 



(24) 
(25) 

(26) 
(27) 



In Eq. |2~41 above. Pt=nh was approximated by P(Ax 2 ), 
which is a function of Ax 2 only. In Eq. [27] z* A represents 
the ^4th percentile of a standard Gaussian distribution, 

hence Ax 2 — 2z* A \J Ax 2 is the (lOO-A)th percentile of 
A^ 2 according to the Gaussian approximation in Eq. 1221 
Since P(NH\A X 2 ) = 1/(1 + e- A * 2 / 2 ) is increasing in 
Ax 2 , this means that the righthand side of Eq. [27] is 
the (lOO-A)th percentile of P(NH\Ax 2 ), which serves as 
the lower bound of the A% PI proposed in the previous 
section. In Table. IIII1 we list z A for a few typical 



choices of probability intervals, assuming that the nature 
is NH. 

For the example experiment used in the simulation 
of section [n] its Ax 2 = 9. Had one followed com- 
mon practice that directly compares *J Ax 2 to the quan- 
tiles of a Gaussian distribution, one would report the 
"specificity" of the experiment to be 99.87% (1 - "one- 
sided p-value"). In contrast, we obtained various sensi- 
tivity metrics for the experiment according to Eq. 1241 
|2"T1 and listed them in Table [IV] First, assuming the 
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For an experiment with A % =9 




0.4 0.6 0.8 1 
P (bin size 0.001) 



1 - Probability 
1 



Gaussian 

Average Probability P 
90% P.I. 




10 20 30 40 50 



FIG. 3: (color online) The left panel shows the distribution of P(NH\x) = P(NH\ Ax 2 ) over the population of potential data 
x that arises from an experiment with Ax 2 = 9 where the truth is NH. The mean of this distribution is 90.14%. Lower bound 
of the 68% and 90% probability intervals are plotted. That is, 68% (90%) of the data x would yield a P(NH\x) that falls to 
the right of the dash-dotted (dashed) line. These two lines are also commonly referred as the 32th and the 10th percentile. 
The right panel plots several sensitivity metrics (subtracted from 1 for clarity), against Ax 2 that ranges from 1 to 50. Note 
that all the lines are decreasing because higher values of Ax 2 corresponds to more sensitive experiments. This is done for three 
different criteria: the Gaussian interpretation (derived from the one-sided p- value with one degree of freedom), P and P^?J° NH . 
The Gaussian interpretation is seen to be over-optimistic in describing the ability of the experiment to differentiate the two 
hypotheses. 



A% 


68% 


90% 


95% 


99% 


Gaussian Percentile z* A 


0.468 


1.282 


1.645 


2.326 



TABLE III: Tabulated Ax%i values for a few typical choice of probability intervals, assuming that nature is NH. 



"Asimov data set" is observed, we have P(NH\x NH ) « 
P(IH\x IH ) rj P(NH\A X 2 = 9) = 98.90%. Secondly, we 

calculated P~t=nh = PtLih ~ P(A)? = 9) = 90.14%. 
That is, the average posterior probability for NH (or IH) 
when it is indeed the correct hypothesis is only about 
90%, which is much lower than its Asimov counterpart 
of P(NH\A X 2 = 9) = 98.90%. Thirdly, the fraction 
Ft=nh — 93.32% of potential data sets would yield a 
Ax 2 that favors NH. And to contrast with the Gaussian 
interpretation, we calculated that only F:$L NH = 23.73% 
of potential data sets would yield a A% 2 above 9, or say, 
yield an evidence as strong as P(NH\x) > 99.87%. Fur- 
ther, the left panel of Fig. [3] displays the distribution 
(vertical axis in log scale) of P(NH\x) = P(NH\Ax 2 )- 
The two vertical dashed lines show that 68% of potential 
data sets will result in P(NH\x) > 95.67%, whereas 90% 
of potential data sets will result in P(NH\x) > 65.79%. 



Moving forward from a fixed A\ 2 value, we next study 
how the various sensitivity metrics compare to each other 
for experiments with different Ax 2 values. The right 
panel of Fig. [3] displays the lower bound of the 90% prob- 



— NH 



ability interval Pt=nh> tne avera g e probability Px=nh-> 



and the Gaussian interpretation based on one-sided p- 
value as functions of Ax 2 - Note that we plotted 1 mi- 
nus the aforementioned metrics in order to zoom in the 
high probability regions. Interestingly, the line of average 
probability P yields a higher value than the lower bound 
of 90% P.I. for Ax 2 <^ 18, an d yield s a lower value than 
the lower bound of 90% P.I. for 1 8. Such behav- 

ior is natural given the definition of each curve. Never- 
theless, both curves are much higher than the Gaussian 
interpretation, suggesting that the Gaussian interpreta- 
tion is over-optimistic in describing the ability of an ex- 
periment to differentiate NH and IH. 



DISCUSSIONS 



A couple of comments should be made regarding the 



A\ 2 representation for sensitivity in determining the 
MH. 

1. We have seen that the distribution of the best es- 



timator of 6 = Ar 



'■.32 



is closer to a Bernoulli dis- 



tribution than to a Gaussian distribution. There- 
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Symbol 
Description 


P 

Average 


Gaussian Interpretation 


P(NH\x) 
Asimov data set 


Ft=nh 
A X 2 >0 


77l3o- 

r T=NH 

P > 99.87% 


p 68% 
*T=NH 

68% P.I. 


p90% 
*T=NH 

90% P.I. 


Sensitivity Metric 


90.14% 


99.87% 


98.90% 


93.32% 


23.73% 


95.67% 


65.79% 



TABLE IV: Sensitivity metrics for an experiment with A\ 2 = 9. 



fore, Wilks' theorem is not a pplicable, and di- 
rect interpretation of \J ^-Xmin as ^ nc number of 
a in the Gaussian approximation leads to incor- 
rect confidence intervals. We provided an analyt- 
ical formula (Eq. |6|) for confidence interval in an 
ideal Bernoulli case, which can be used to gen- 
erate approximate confidence intervals for similar 
cases. For more general full MC simula- 

tion is needed to construct confidence intervals, as 
advocated in Ref. (23j . 

2. Even if a confidence interval for Am| 2 is con- 
structed correctly, its confidence level can not be 
directly interpreted as how much the current mea- 
surement would favor the NH (IH) against the 
other. Despite possible agreement between con- 
fidence intervals and Baycsian credible intervals 
under certain circumstances as discussed in Ap- 
pendix. [E] such agreement does not apply to the 
current MH problem where there are strong con- 
straints imposed on M| 2 - 

Additional comments should be made regarding the 
Bayesian approach. 

1. In principle, results from different experiments can 
be combined within the Baycsian framework. One 
example can be found in Ref. [37| . in which a 
Bayesian method was applied to constrain #13 and 
CP phase S with existing experimental data. Re- 
garding to the MH, results from different exper- 
iments can be combined through the integral in 
Eq. [SJ Specifically, one can integrate over the 
nuisance parameters regarding experimental sys- 
tematic uncertainties, while leaving nuisance pa- 
rameters regarding the relevant neutrino masses 
and mixing parameters unintegrated. For example, 
suppose there are two independently conducted ex- 
periments, labeled by j = 1,2, and that their re- 
spective observed data Xj corresponds to the model 
P(xj \6, 77* , rjj , MH) under MH=NH or IH. Here the 
vector of nuisance parameter 77 in experiment j is 
separated into two pieces rj* and ijj, where ijj is 
unique to the experiment and 77* is common to both 
experiments. Of course, 9 is the parameter of in- 
terest and hence always common to both. Then, it 
would be useful for the different experiments to not 
only present A% 2 (Eq. ITT1) , but to also present 

P( Xj \6,r)*,MH)= f P^Orf ^^MH^Wrf ,MH)dr,, 



in order that one can calculate the overall likeli- 
hood P(xt,x 2 \0,T]*,NH) = U^PixjlO^^NH) 
for further inferences. 

2. We have listed a few different metrics to represent 
sensitivity of future experiments in Sec. lIIII Each of 
them convey different information. In the case that 
one has to choose a single number to summarize the 
experiment sensitivity, one convenient choice would 

be P = Pt=nh = Pt=ihi the average probability 
reported for the true underlying model . For all 
other metrics that were introduced, the sensitivity 
scores need to be calculated separately assuming 
NH or IH is the true model. 

3. For general models where nuisance parameters are 
present, it is possible to measure the specificity 
of an experiment conditional on different possi- 
ble values of the nuisance parameters. For in- 
stance, suppose NH, and that a particular value 
of the nuisance parameter, say 77 = 770 1 is true. 
Then the relevant population of potential ex- 
perimental results consists of x generated from 
P(x\NH,t] ) = J P(x\9,r]o,NH)P(9,7] \NH)d9. 
Accordingly, P(NH\x) can be obtained for each a; 
in this population with Eq. |5] 5 , and for e.g., their 

mean Pt=nh( t 1o) and quantiles P^Nnivo) serve 
as more refined sensitivity metrics for the experi- 
ment, and can be plotted against a range of possible 
?7o values. Such application is particularly useful 
when the separation of MH strongly depends on 
the value of 77. One such example is long baseline 
v e or v e appearance measurements (from or 
beam), in which the sensitivity of MH strongly de- 
pends on the value of CP phase of lepton section 
Sep and neutrino mixing angle $23- 

4. The Gaussian approximation in Eq. [52] allows an- 
alytical calculation of various sensitivity metrics. 
Be aware that such calculations arc valid under the 
assumption that the possible range of 9 under ei- 
ther hypothesis is narrow enough that it can be 
reasonably represented by a single point, and that 
nf H — [i\ H « fif H ~ n\ H . For more general 
cases, numerical such as MC methods arc needed. 



One should not take into account the information of 7? = rjo 
in calculating the probability, since one does not know the true 
'value of r\ when analyzing experimental data. 
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5. Finally, we emphasize that sensitivity metrics are 
designed to evaluate an experiment in its planning 
stage. It can be used to see if an experiment with a 
proposed sample size, i.e., the expected bin counts 
{fii, i = 1, • ■ ■ , n}, will be large enough to have a 
high probability of generating desired strength of 
evidence to support the true hypothesis. But once 
the data are observed, the calculation of sensitivity 
metrics is no longer relevant. One should clearly 
differentiate results deduced from data from that 
from the sensitivity calculations. 

VI. SUMMARY 

In this paper, we perform a statistical analysis for 
the problem of determining the neutrino mass hierarchy. 
A classical method of presenting experimental results is 
examined. Such method produces confidence intervals 
through the parameter estimation of Am^ based on ap- 
proximating the distribution of y A\ 2 as the standard 
Gaussian distribution. However, due to strong existing 
experimental constraints of M| 2 = |Ato§ 2 |, the parent 
distribution of the best estimation of Am^ is better 
approximated as a Bernoulli distribution rather than a 
Gaussian distribution, which leads to a very different es- 
timation of the confidence level. The importance of using 
the Feldman- Cousins approach to determine the confi- 
dence interval is emphasized. 

In addition, the classical method is shown to be inad- 
equate to convey the message of how much results from 
an experiment favor one hypothesis than the other, as 
the agreement between the confidence interval and the 
Bayesian credible interval also breaks down due to the 
constraints on M| 2 . 

We therefore introduce the Bayesian approach to quan- 
tify the probability of MH. We further extend the dis- 
cussion to quantify experimental sensitivities of future 
measurements. 
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Appendix A: Derivation of P(Axmin) f° r case II: 

e = {-i,i} 

Let 6q denote the true parameter value from which the 
data are generated. Under case II, when 9q = 1, the 
statistic Axmm(^o) in Eq. [3] is directly related to Ax 2 
in Eq. [5T] (recall that the notation 8 = 1,-1 refers to 
NH and IH, respectively) as Ax 2 lin (l) = max{0, — Ax 2 }. 
The result in sectionllVlimplies that, under 9 Q = 1, — Ax 2 
follows an approximately Gaussian distribution with 

mean — Ax 2 and standard deviation 2\J A\ 2 ■ Similarly, 

when 9 = —1, the statistic Ax 2 rai „(0o) = max{0, Ax 2 }, 
where A% 2 follows approximately Gaussian distribution 

with mean A% 2 and standard deviation 2\J Ax 2 - There- 
fore, whether the truth is do is 1 or — 1, the distribution 
of AxLJ^o) is such that P(A X 2 mm (e ) > t) = 1 for 

t < 0, and that P(A X 2 min {9 ) > t) « \ - |erf (^^j 

for t > 0. 



Appendix B: Confidence Interval vs. Bayesian 
Credible Interval 

As emphasized in Ref. [23], the classical confidence in- 
terval should not be confused with the Bayesian credible 
interval. However, it is rather common that physicists 
approximate the confidence interval as the Bayesian cred- 
ible interval, especially in MC simulations, where previ- 
ous measurements of some physics quantities are used as 
inputs. Such approximations turn out to be acceptable 
under the following condition. 

Consider the condition that the pdf (or pmf) of the 
best estimation of the unknown parameter 9 m i n only de- 
pends on its relative location with respect to the true 
parameter value, that is, 

-fe m i n |e tl -uo(^min|#true) = M^min — #truc), (Bl) 

for some non-negative function h such that h{t)dt = 
1 . Models that satisfy Eq. IB1I are said to belong to a 
location family, where #truc is called the location param- 
eter. When there is a lack of strong prior information for 
^truc it is usually reasonable to assign a uniform prior 
for it, that is, to assign -Pe tr uc ($true) oc 1. If so, we have 
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P@tme |0mm (^truc I #min) 

= ^e m i„ | etruo (^min | ^true)-Pe tr uo (#true ) / P& mia (#min) 

01 p e mi „|e tr „c (^min|^true)-Pe t r U e (^truo) (as a function of 9) 

^ ^e m in|e t rue(^minPtrue) = M^min — #true) • 



In the above, the first step follows from the Bayes' theo- 
rem, and the third step incorporates the uniform prior on 
tr ue- Since for any fixed truc , h(9 min ~9 truc )d9 min = 
1, the above indeed implies that 



P, 



e trU o|e„ 



inl^truc) — ^(^min ^true)- 



(B2) 



For any threshold level c and the observed value of 
#min, define a plausible region for 9tme by A(0 m i n ,c) = 
{9 : P& min \e tIue (e m in\e) > c}, then 



A(9 min ,c) = {9:h(9 min -9)>c} 



--{Ov 



t : h(0 -t)>c} = 



A(Q,c), 



(B3) 



r 



where the transformation t = 9 — 6 m m is used in step 2, 
and in general, the notation a + A for a point a and 
a set A represents the set that consists of points a + a 
for all a <G A. In words, Eq. IB3I says that the plausible 
regions based on different 9 m [ n with a fixed threshold c 
are simply shifts in location of each other. First, under 
the Baycs framework, A(6 m i n ,c) can be considered as a 
credible region (most often an interval). The probability 
that 9 falls in A(9 m i n , c) is called the level of the credible 
region, and is given by 



p e truo \e mi A d G A(9 mhl ,c)\9 min ) 
(by Eq.|B2]and Eq.[B3j 
(letting t = 6- 9 min ) 



A(8„ 



^e true |e min (0|0min)d0 



h{9 - 9 min )d9 



0min + A(O,c) 

h{t)dt. 

A(0,c) 



r 



On the other hand, under the classical framework, which is given by 
A(9 min ,c) serves as a confidence interval, the level of 

I 



(by Eq. [B3 



(by Eq. IBIp 

(letting t = 9 min - 9) 



P @mial@tiae (6€A(9 min ,c)\e) 
=^e min |e ttue (0 e 9 min + A(0,c)\e) 
=Pe iain \e t ^(Kin&0-A(0,c)\6) 

h(9 min - 9)d9 min 

8~A(0,c) 

h(t)dt. 

A(0,c) 



111 summary, the region ^4($min,c) can be interpreted 
both as a confidence interval and a credible region of 
the same level. 

A most useful special case where Eq. IB 1 1 is satisfied is 
the case where 9 m i n strictly follows a Gaussian distribu- 



tion with mean 9 trU e (such as Case I of section [TTJ) and 
that the standard deviation of the Gaussian distribution 
did not depend on 9 true- As we mentioned in section UTT 
it is shown by Wilks [34| that, based on a large data 
sample size, the statistic 9 m i n does approximately follow 
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a Gaussian distribution with mean at Otme under certain 
regular conditions. Hence, it is not unacceptable to con- 
struct an a level confidence interval and interpret it as an 
a level credible interval, as long as the standard deviation 
of the Gaussian distribution has weak or no dependence 

on "true- 

However, in the MH determination problem, the reg- 
ularity conditions are violated due to the existing exper- 
imental constraints on \9\ = M| 2 . As a result, condi- 
tion Eq. IB 1 1 is far from being satisfied, and there is no 
longer a correspondence between confidence intervals and 
Bayesian credible intervals. Indeed, strong inconsistency 
between implications of the two types of intervals can 
be seen from the following specific example belonging to 
case II of section HU It is easy to come up with an ob- 
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